Author
|
Topic: OSS-3 vs Polyscore 5.5
|
Bob Member
|
posted 08-30-2007 11:25 AM
Ray; I am very interested in what you may have learned from John Harris while conversing with him at APA about the differences between OSS-3 and Polyscore 5.5 scoring. Can you expound on your comments regarding “personally electrifying conversation about statistical things, scoring algorithms..? ” Bob IP: Logged |
rnelson Member
|
posted 08-30-2007 02:45 PM
Mr. John Harris is a really good guy.He's obviously brilliant, but approachable and social and willing to talk about this stuff. He wasn't overly critical or aggressive but asked very good questions, provided information, and had insightful comments. Mostly he was patient with this know-it-all upstart from CO, who probably knows a fraction of what Mr. Harris knows about algorithm development and validation. That is a remarkable thing, considering that Polyscore is a proprietary and money-making thing, which our work may encroach on - because of the free-beer thing. That says an awful lot about his human-ness. More people should be like that. Plus he laughs and seems to enjoy himself. You can't ask for more than that. Talking with him for a few minutes of "geek-speak" was perhaps the highlight of my week at APA Alright, enough with blowing sunshine at Mr. Harris. Mr. Harris suggested an n-fold crossvalidation for OSS-3. So, I got up early the last few days and completed that yesterday morning and set the results to him. n-fold is sometimes referred to as k-fold, and is a technique used in data-mining and neural networking applications to validate algorithms in the absence of large or good validation samples. Its a kind of validation experiment without a validation sample. Cool, huh? The results were not surprising - OSS-3 works well. The NAS (2003) report was critical of Polyscore regarding claims that version 5.5 was based on neural-network models while 3.3 and earlier versions were based on logistic regression (logits). Logits use logarithmic values and so does OSS-3, but we do it differently. NAS stated they couldn't tell the difference between Polyscore 3.3 and version 5.0 So, it seems that both claims are correct. The underlying algorithm may not have changed, but he used a n-fold neural networking model for validation. Now, so have we (the OSS-3 team). Mr. Harris also informed me that Polyscore uses a two-stage decision rule approach, and I think he may have even said an asymmetrical alpha (not sure). He never described those in the past. Polyscore trims outlier or extreme values at the interquartile range. OSS-3 discards less data, and uses the normal range (2 standard deviations) of the ipsative standardization of all lognormal ratios, by component. There is no information about how polyscore handles inflated alpha and combinatoric addition rule problems that drive errors and inconclusives in multiple facet and mixed issues exams. I assume there is no thought to this in Polyscore, or CPS-II (or Identifi, Chart Analysis, or White-Star for that matter) - only because there is no description of these important mathematical concerns. I could be wrong about that, but that is the state of our present proprietary algorithms; no one has really told us how they work, and how or whether all these known complications have been addressed. OSS-3 uses a Bonferonni correction to the alpha to reduce excess INC and increased errors with spot scoring. I hear people say erroneously that Polyscore was trained on 10,000 cases. It was not. I have that data. It is about 848 ZCT cases of which 600 some odd are marked as usable. Scoring features for Polyscore are based on logistic regression of perhaps 10,000 different features of the 600-odd tests - of which about six features were kept and used. Those features were derived atheoretically , though logistic regression (like in some data mining applications). They would be meaningless to us examiners unless we learn to think about them. Atheoretical approaches are not incorrect, and some well known personality tests were developed atheoretically. OSS-3 uses Kircher features, which were developed from the existing theoretical frameworks, and published research about interpretable polygraph features. Kircher and Raskin used Discriminate Analysis to evaluate a number of features and kept the best in what we now know as those loveable and sensible Utah features. What we call the Kircher features are simply the three most robust of the 12 or so Utah features (line length, amplitude and amplitude). So they make good sense to us humans, and at the same machines can make good reliable use of them. Discriminate analysis and logistic regression have similar purposes. Logistic regression is more suited to atheoretical frameworks, non-normal data and ordinal data. DA can provide more information, but requires continuous normal data. Transformed polygraph data are continuous normal data. Take your pick, it really only depends on your expertise and situation to determine which is preferred. I now know that the OSS-3 training and validation samples (archival samples which we did not construct) are included in the Polyscore training data. I do not know whether those cases were selected from the Polyscore data or were contributed to the Polyscore data. So, Polyscore and OSS-3 (and all other other scoring algorithms) are faced with the same set of challenges. We all simply handle those challenges differently. There is an important difference between OSS-3 and CPS-II, which to me looks like it is basically the Utah approach with a uniform-Bayesian decision model. CPS-II results are base-rate dependent and the uniform prior for CPS-II is set arbitrarily at .5. In real-life we don't know our prior (base rates), and a lot of Bayesian soluations don't use uniform priors even for dichotomous decisions like polygraph results (sr/nsr). So, generalizing Bayesian probabilities to field situations and single cases is theoretically difficult. OSS-3 uses a final solution proposed by Gordon Barland in 1985 - which based on the standard cumulative normal (gaussian) distribution, for which probability results are more easily generalized to field and single case situations in which the base rate may be unknown. Everyone should know that OSS-3 didn't suddenly come from no-where. It is based on over 20 years of work and publication. A number of people have researched and described a some good principles and ideas about polygraph scoring. Its not a large volume, but the ideas have been formative for some time. OSS-3 is based on the good work of smart people whose names you know. Gordon Barland proposed the use of the normal distribution to estimate errors (1985). John Kircher and David Raskin (1988) defined the usable features. Don Krapohl described the use of the R/C ratio, which is a good solution to the need for a dimmensionless representation of differential reactivity. Then, he and Barry McManus (1999) published the first OSS (handscoring) procedure. That translated well to a computer model due to its understandable and well documented structure. Stuart Senter and Andrew Dollins (2003) substantially improved our understanding of the effects decision rules and there results are consistent with mathematical predictions. None of them by themselves have described a complete solution. Kircher and Harris have come closest so far, but they have proprietary, financial, and intellectual property interests (nothing wrong with that) that prevent them from completely describing what they do. Credit for OSS-3 should go to all of those people. OSS-3 is a product of the polygraph profession, not a single person. We are simply the construction crew that did the math and took advantage of the timely availability of all those good ideas. I wish I knew more about Idenifi and Chart Analysis/White Star, but there is so little available information describing how those methods work (i.e., nothing, nada, zilch, zippo, zero) that I couldn't say anything. The developers certainly have every right to protect their IP, and to make money - but its also difficult to defend in court if one had to. In the psychological sciences, it is common that machine scoring models do exactly what humans would do (though neural networking and data mining are changing that a bit). In polygraph, we have tolerated some form of mystified black-box computer algorithm that we are told we either can't understand or have no business understanding. We assume erroneously that the computer is somehow smarter than we are. To that I say - BS. Computers do only what they are told to do and trained to do. With OSS-3 the combination of good ideas from good scientific work seems to work well. In'nit cool how science works when people do there homework, document things and publish it for you. We get smarter. We look forward to the completion of a thorough written description of the OSS-3 method from start to finish. There is some info here. http://www.oss3.info/other.xhtml r
------------------ "Gentlemen, you can't fight in here. This is the war room." --(Stanley Kubrick/Peter Sellers - Dr. Strangelove, 1964)
[This message has been edited by rnelson (edited 08-30-2007).] IP: Logged |
rnelson Member
|
posted 08-30-2007 07:52 PM
I neglected earlier to comment on Mr. Nate Gordon's ASIT algorithm.According to Mr. Gordon, ASIT is based on the Horizontal Scoring System (Gordon and Cochetti, 1987; Gordon, 1999), and he cites the work of Honts and Driscoll who described a very similar method as Rank Order Scoring. Mr. Gordon claims credit, and expresses some dismay at not being cited himself, for providing Honts and Driscoll with the idea of a rank scoring system. The earliest reference I could find for rank scoring of polygraph data is Suzuki, Ohnishi, Matsuno, and Arasuna (1979), but that involved POT tests. Rank scoring systems are well known in other testing realms, and represent a nonparametric solution to the problems of normality and variability among data values. For example: contestants in a cross-country foot race may have times of 18:06 minutes 18:10 minutes, and 21:05 minutes, for first, second and third place finishers. The first and second place finishers are separated by only 4 seconds and the second and third place finishers are separated by a matter of minutes. They rank 1, 2, and 3. The space from 1 to 2 is one unit, and the space from 2 to 3 is also one unit - with no consideration for the linear magnitude of that space. Rank systems are ordinal position only. They reduce extreme values and amplify values that might be considered apparently trivial in other systems. That is OK in ranking systems. Miritello (1999) offered a different ranking system based on R/C spot scores, but was careful not to provide any unfounded guidance regarding a decision model or decision thresholds. If I recall correctly, Mr. Gordon refers to the work of Honts and Driscoll for decision thresholds. Honts and Driscoll, I believe provided decision guidelines only for grand total results. Mr. Gordon described the ASIT model as using a derivative of the Honts and Driscoll decision thresholds for spot scoring: achieved, if I recall correctly, by dividing the threshold by the number of spots and charts for a recommended spot total per spot per chart. I have some mathematical questions about that. A decision threshold for grand total is an inferential mathematical estimation based on an evaluation of the the variance and classification accuracy rates of the grand total. Decision thresholds for spot scores should be based similarly, on an evaluation of the variance and classification accuracy of the spot scores. To achieve an inferential estimate, we must know the variance of the spots and that requires a reasonable understanding of spot by spot results. We have strong reasons to anticipate that the variance of spot scores, in their transformed rank values, may be quite different than the variance of the total. There is no description of anyone investigating that, or ever describing a solution. A further problem exists in the ASIT model, regarding the evaluation of spot scores, because replacing the natural variability of the spots with the uniform nonparametric variance of a rank order system makes the variance of the spot mathematically meaningless. Someone please correct me if I am wrong here. If I am correct that: 1) we don't know the variance of the spots, and 2) the spot variance is meaningless anyway. Then the ASIT model is not a mathematical model, but a sorting model. That's not entirely a problem, as long as it doesn't pretend to make estimations about errors (which we are increasingly asked to do). I believe the Miritello ranking scheme is also susceptible to distortion of the meaning of spots, even though the use of R/C ratios means that RQs and CQs are combined in Miritello ranking (ASIT/Horizontal Scoring ranks RQs and CQs separately). The problem is that there will always be a high rank and a low rank - even when the subject is truthful to all RQs. Again, someone correct me if I've got these concepts wrong. With OSS-3, the problem of needing to know the variance of the spots is solved by transforming both the overall and spot scores to standardized (z-score) values. Therefore we know that the variance of the spots is identical to the variance of the overall score. That value is 1. Examiners could certainly calculate standardized values by hand, but that would require estimates population parameters for spots (norms). Most examiners would find that an unbelievable pain in field settings - so we let computers do the heavy-lifting with complex math. That is the only reason OSS-3 needs to be a computer scoring algorithm. If there has been no investigation of or solution to these complications with the ASIT/Horizontal Scoring Sytem (regarding spot values), then I believe we should be very cautious about disseminating and teaching a model that is incomplete (in other words don't teach it if we don't know it works, and if the theory and estimations indicate problems or unsolved/uninvestigated complications - the polygraph profession should be outgrowing that kind of thing by now). Miritello included an advisement about this in the 1999 article on rank scoring of spots. I think the only other concern I have is that it is generally not advisable, and not well received, to use a single sample as an approximately of the population. Mr. Harris used the n-fold validation method, and we trained OSS-3 using bootstrapping - both methods will provide more robust estimates of population parameters than a single sample. I started to ask Mr. Gordon about these questions, but he didn't have many answers. Someone may have solved these problems, and may know the answers. If so, please let us all know. I do like ranking models. They are elegant and understandable. I would really enjoy knowing the answers to these riddles, because that could open up new and simple opportunities. I just don't know the answers myself at this point. r
------------------ "Gentlemen, you can't fight in here. This is the war room." --(Stanley Kubrick/Peter Sellers - Dr. Strangelove, 1964)
[This message has been edited by rnelson (edited 08-30-2007).] IP: Logged |
Barry C Member
|
posted 09-01-2007 04:53 PM
Nate describes his math somewhat in his (thesis-based) Masters thesis from UNISA (on the validation of his FAINT interview), which can be found here: http://etd.unisa.ac.za/ETD-db/theses/available/etd-07052005-123714/unrestricted/00front.pdf I don't think you'll find the answers your looking for though, but it's a start. IP: Logged |
rnelson Member
|
posted 09-01-2007 09:24 PM
Thanks Barry,That doesn't have any information pertaining to my questions. I did re-read Honts and Driscol (1988 I think) - their study of the Rand Order Scoring System (which Mr. Gordon claims credit for). They found Rank scoring inferior to traditional spot scoring of mixed issues, though the overall scoring method was equivalent or better. That's my point, the mathematical predictions for spot/mixed scoring with Ranks really stinks. Krapohl and other's looked at a rank system, (2001 I think) but I can't recall what they found right now. Something for tommorrow. I'm going to go out on a limb and suggest it is unwise to disseminate and teach unfounded material at a national training session like APA. The responsible thing to do would be like Miritello - stop where our knowledge stops and advise people of what we do not know. In this case we do know - both the mathematical models and the data seem to underperform for rank scoring of spots. r [This message has been edited by rnelson (edited 09-01-2007).] IP: Logged |
Bob Member
|
posted 09-05-2007 12:56 AM
Ray;I apologise for taking so long to thank you for your informative post regarding OSS-3 and Polyscore 5.5. I confess though, for me, it was just-a-tad (ok, quite a bit)over my head. Your information did cause me to purchase a a couple of books on biostatistics though-in hopes of moving myself from being statistically ignorant to becoming statistically aware :-) So your time in providing the information was not wasted. Can you comment on the 'scoring windows' of OSS-3 vs Polyscore. As I understand Polyscore 3.0 (and I am unaware of any difference to 5.5, and now v6?): After Question Onset: (1) there is a 2 second delay {latency period) for the GSR, and the response window being evaluated is for the next 12 seconds (DACA scoring rules however- to my understanding-utilizes a .5 sec latency period after question onset. Frankly, I don't understand the basis for this; Boucsein ("Electrodermal Activity" page 133) reports normal latencies for exosomatic EDR's are normally between 1 and 2 seconds, but may be up to 5 secs in cases of skin cooling. He does not report any studies showing a latency period of less than 1 second after stimulus onset.) (2)there is a 2 second delay (latency period) for the thoracic respiration, and the response window being evaluated is for the next 16 seconds- AND is not being evaluated as respiratory line length. (Am I incorrect about Polyscore and RLL?) (3)there is a 2 second delay (latency period) for the pulse, and the reponse window being evaluated is for the next 8 seconds. (4) there is no delay (latency period) for blood volume changes, and the response window begins at question onset for 8 seconds. Thanks again for the info you provide. Bob IP: Logged |
Barry C Member
|
posted 09-05-2007 05:42 AM
Bob,If you're going to torment yourself like this, then I suggest you read this doc too: http://www.stoeltingco.com/polygraph/peer/page2.htm Granted, it's put up by the "competition," but it's a starting point for some of the disagreements on how to do it correctly. I'm not sure how they came up with the response latencies. It may be that they scored multiple windows of time and learned those were optimal. If that's the case, some of the criticisms cited in the above report could be very problematic for other data sets. As far as the .5 sec EDA figure goes, that is the absolute minimum amount of time it can take for a reaction in that channel. Anything earlier can't be the result of the stimulus (question) we present. The Utah scoring system, and probably CPS (but I don't remember) considers two latency periods: EDA (.5 second) and PLE (2 seconds). IP: Logged |
rnelson Member
|
posted 09-05-2007 10:36 AM
Bob, I knew that perhaps two people might read that. Thank you for your thoughtful questions.Barry, the peer review at the Stoelting site is quite interesting for both its content, and the degree of latent aggression. Could you explain further about the potential problems with the time-window and other data sets. As I understand them, time windows represent an arbitrary dimension in feature development, though they can be investigated empirically for there effectiveness. I've heard from Barry (on this forum) that Kircher has investigated RLL and recommends 10 seconds. Onset latency is another arbitrary dimension that can be investigated empirically. I have not seen any data on scoring windows or onset latency. There are really two questions that should be considered: 1) does one length (for either window or latency) perform better than others, and 2) is the difference statistically significant. Its possible (perhaps even likely) that one lenth out-performs others. A less likely result is that those differences by themselves are statistically significant (remember that statistical significance is rare, by definition). What that means to use humans, if the difference is not significant, is that we might not even notice the difference, using one length over another. The corollary to that is that even non-significant differences, when combined with other non-significant differences can add up to something more meaningful. When we are talking about a test that is actually fairly accurate (in the social sciences realm), gaining another point or half-point of percentage in improved accuracy is sometimes really dagone difficult - so we seek to exploit the combinations of improvements which by themselves might be non-significant. We did not investigate interpretable features with OSS-3, but instead used the existing foundation of knowledge from Kircher and Raskin (1988). AT that time, they described using 20 seconds following the stimulus-onset, with a .5 second latency. (I'll double check this.) All that does is provide scoring window long enough to capture all data that might have been of interest to their discriminate analysis. The .5 second onset latency is again arbitrary, but coupled with the procedure for slope-testing, will provide a minimal assurance that an early onset (artifact) reaction is not included in the analysis. We all know there will be phenomenological observations with some individuals: late reactions, immediate reactions, reactions during stimulus, reactions after answer, etc. It might be possible to build an algorithm that would evaluate features uniques to individual phenomenology, but that would require someone far more propeller-headed than I, and that would require funding. It is perfectly acceptable to select common features that will work with most persons - with the foregone assumption that there may be exceptions and outliers. I have no information about scoring windows or onset latency for Polyscore features. If you could email or post some information here that might be interesting. Even a citation would be more than I have at present. The NRC/NAS report (2003) describes 3.0 features
- GSR Range
- Blood Volume Derivative - standardized to the 75th percentile
- Upper Respiration - standardized to the 80th percentile
- Pulse Line Length
- Pulse - standardized to the 55th percentile
I would assume that Mr. Harris' logistic regression included an analysis of the contribution of these and other features standardized to varying points. It is common to standardize to the mean (average); that way values above the mean become positive values that indicate the number of standard deviations above the mean, while values below the mean become negative numbers. Standardizing to a specified percentile causes all values above that percentile to remain positive, while values below that percentile become negative numerical values representing the number of standard deviations below the specified percentile. This might be quite useful in a logistic regression model, in which a certain percentile can become a cut-point for score assignment. Data (Kircher features) for OSS-3 were obtained using a software tool created by John Harris for DACA (DoDPI) during 1994. I have no documentation, but have to assume he followed the recommendations of Kircher and Raskin (1988) in how to obtain those features. That tool provides Polyscore features as
- LXD 75th (percentile) = electrodermal
- Blood Volume 80th
- GD Min = electrodermal
- Upper Respiration 65th
- Pulse 10th
- GSR 30th
You can see there may have been some evolution of features from earlier to later versions of Polyscore. The Lower pneumo seems not to be used at all, and the atheoretical development of theses features (they work mathematically without regard for any description of construct validity) means that we humans can't easily visualized what they would look like in chart data. Remember that standardizing to those different percentiles, instead of the mean, change the location of change from positive to negative standardized values. I have no idea whether Polyscore standardized to normative or ipsative percentiles. We also know that Polyscore standardizes the RQ data to the mean of CQs - which is simply yet another way to achieve a dimensionless mathematical representation of the differential reactivity between RQs and CQs. From Shawn Edwards (Stoelting product manager) at the APA conference, I learned that Kircher is now recommending some different (presumably optimized) scoring windows, based on his analsysis of their data. Here is what I recall:
- RLL = 10 seconds
- EDA = 12.9 seconds
- Cardio = 14.5 seconds
Shawn is hopefully on-belay here, and will chime in and catch me if I am in error (I'd do the same for him of course). Shawn is a really nice guy (smart and has a sense of humor). We met at APA and laughed about our first interactions in this forum. Stoelting provides a well featured software package that includes the ability to edit artifacted data, adjust onset latency, and change scoring windows. This is important, because we recognize that these are arbitrary parameters, that must be investigated empirically. We might learn more later, and the ability to make informed adjustments (without re-coding or recompiling the whole darn software package) is nice. But just because you can, doesn't mean you should, so no-one should make any uninformed changes to their system settings. (I also believe that system settings should be included in report output, for auditing purposes.) The new Limestone software includes advance features for adjusting onset latency and scoring windows, and they have a powerful little measurement tool that allows for marking and editing artifacted values, while retaining and providing all original/raw data and system settings for auditing and QC. Limestone will also output the raw data (Kircher features to a text file (.csv) to facilitate easy aggregation of data for researchers. I've spoken with the folks at Lafayette about also providing on output of the raw Kircher features. One thing to note is that both Limestone and Lafayette designed their first feature extraction tools following the OSS (1) instructions of Dutton (2000), which are incomplete regarding slope testing for early onset reactions and tracings that descend substantially at onset before ascending. This will result in measurement errors. The correct procedures were described by Kircher and Raskin (1988). Limestone has already corrected this, and the programmers at Lafayette have been alerted and assure me the error will be corrected. We've also provided both Limestone and Lafayette with a list of advanced settings that should be available to users of OSS-3 - including things like alpha, bonferonni, latency, window length, etc. One important thing to note with Polyscore, CPS-II, and OSS-3, is that results cannot be more reliable than the data you put into those algorithms. Remember: garbage in = garbage out (its and oldie but goodie, and it still applies). Both Gordon Barland and Eric Holden have talked about the important differences between "scoring charts" and "analyzing charts." Dr. Barland emphasizes that scoring should begin, but not end, with numerical analysis. I believe it is tempting for many to allow computers scoring algorithms to attempt to interpret cruddy data which we humans find too ugly for comfort - asking ourselves 'what does the computer algorithm say,' while reassuring ourselves that the computer must surely be smarter or do things we couldn't. I've heard examiners say "who am I to disagree with the computer score." I'm nowhere near ready to agree with that, and would instead advise that if you wouldn't score a particular segment, then you should not let the computer score it. Our present technologies for detecting and rejecting artifacted or unusable data are not yet well described and appear still unsatisfactory. As always a valid/usable test result depends on a properly administered examination. But that's a whole 'nuther argument. Peace,
r
------------------ "Gentlemen, you can't fight in here. This is the war room." --(Stanley Kubrick/Peter Sellers - Dr. Strangelove, 1964)
[This message has been edited by rnelson (edited 09-05-2007).] IP: Logged |
rnelson Member
|
posted 09-05-2007 11:41 AM
I found this in the 1994 patent document. GSR = 11 seconds, beginning 2 sec after stimulus onset
- Pulse line lengths 8 sec, beginning 2 sec after stimulus onset
- Upper respiration 16 sec, beginning 2 sec after stimulus onset
- BV derivative 8 sec, beginning at the stimulus onset
I still have no info on what respiratory data represent. The numerical output might be transformed RLL, or something else. I don't know. Keep in mind two things: 1) Polyscore features are atheoretical - why they work doesn't matter as much as that they work mathematically, and 2) procedural solutions (like DACA and other standards) often approximate (which includes simplified departures from) empirical data, for the sake of simplicity and reliability - which improves field accuracy. See the other thread regarding first-aide training and the treatment of medical shock for an extreme example of this. See the present DACA chart analysis document section on calculating RLL for an example of how a committee can really kludge a solution to the point of senselessness (I doubt whether that can be implemented reliably, without continual training, retraining, and QC/supervisory oversight - a simpler procedure would be much easier to implement reliably).
You should also remember that RLL is a derivative measurement - respiration activity (physiologically) is not measured through linear tracing length, nor is EDA or cardio activity measured in mm or inches the way we do it. r ------------------ "Gentlemen, you can't fight in here. This is the war room." --(Stanley Kubrick/Peter Sellers - Dr. Strangelove, 1964)
IP: Logged |
Barry C Member
|
posted 09-05-2007 02:06 PM
quote: Barry, the peer review at the Stoelting site is quite interesting for both its content, and the degree of latent aggression. Could you explain further about the potential problems with the time-window and other data sets.
Ray, I haven't read the peer review in some time, and I think this was one of their criticisms. That is, they suggested the APL crew came up with (presumably, anyhow) optimal scoring windows for that one particular set of data. If they (APL) didn't then test those "optimal" windows on another data set, then we don't know if they found what is optimal, on average, for all data sets, or if they found what is only optimal for their single data set. Kircher's 10 second window for RLL was not arbitrary, according to John Kircher. He tested data at fractions of a second and it happened to work nicely for this one. He made a point to tell me 9.9 and 10.1 seconds were not as good as an even 10.0. Whether he published on that is a different story. He may have as I recall Don Krapohl telling me the number wasn't arbitrary as I once thought it was. He may have read it or he may have got it from the source as I did. According to Shawn, the other scoring windows are what John has found to be optimal as well; although, I think more recent data has resulted in an "update" of what's optimal. Shawn, dare you try again? Easy Ray. Back away from the fly swatter. IP: Logged |
Bob Member
|
posted 09-05-2007 03:39 PM
Ray & Barry;AH- this is more closer to what I was interested in knowing, thanks for your reponses. Also (I just ran across some info) regarding changes in latency/ response windows in reference to Polyscore 4.0 (1998): Respiration: after 1.5 sec latency from Q on-set, response window of 16.5 sec; EDA: after 1 sec latency from Q on-set, a response window of 13 sec; pulse: after 1.5 sec latency after Q on-set, a response window of 8.5 sec; blood voume: no latency period, response window of 18 sec. I don’t know about changes after that (v-5.5 and now v-6 ?). The question: What IS OSS-3 using: 20 seconds following the stimulus-onset, with a .5 second latency for Each Channel ? Or is OSS-3 using Kircher’s recommendation of some different (presumably optimized) scoring windows, based on his analsysis of their data (which you described as: RLL = 10 seconds, EDA = 12.9 seconds; Cardio = 14.5 seconds)- and if so, is there a latency period being applied ? Ray, regarding your earlier comment “The Lower pneumo seems not to be used at all,” in Polyscore: I posed a question to John Harris via email in 2002 regarding this; he responded “We have found that the abdominal signal is mostly a noisy version of the thoracic. However, what additional information that is in the abdominal is used.” He was asked “if the thoracic and abdominal tracings were being evaluated seperately, was there a reason they did not permit in the software the ability to individually ‘tag’ for artifact/distortion removal for both P2 and P1 ?”; John’s answer was “We feel that since the two are so highly correlated, they should be treated the same for artifact removal.” Bob IP: Logged |
rnelson Member
|
posted 09-05-2007 04:26 PM
Thanks again Bob.For training, OSS-3 uses whatever John Harris programmed into the tool he created for DACA(DoDPI) in 1994. There is no documentation, and he seems not to recall from memory. We have to assume he followed the instructions from Kircher and Raskin, 1988. I suppose I could spend some time and try to figure it out further. The best suggestion I have is to use Kircher's most recent recommendations. The effect of scoring window length, will be to assure adequate sensitivity to the signal of interest, while rejecting as much noise as possible. There are other considerats that are equally if not more important than window length - such as slope and artifact detection. What you are observing in the different polyscore versions might be the effects of the succession of training and validating experiments with that algorithm. I don't have any real disagreements with John's response. It would be an eqivalent mistake to factor two EDA values into a scoring algorithm. Simply discarding the noisier one is a simple solution. OSS-3 has a different solution, that could theoretically improve the reliability of the data. OSS-3 retains a procedure from previous OSS versions, in which the pneumo furthest from Zero (mean) is retained, whether upper or lower, except that the value is defaulted to zero if the upper and lower pneumo scores result in opposite mathematical signs. Opposite signs are more likely to occur when the pneumo data are messy, artifacted or uninterpretable. So, the zero-default/oppositive sign procedure can be considered as protective against some noise. Zero values are removed from analysis in OSS-3. When I looked at the Stoelting software at APA, it was set to .5 seconds of latency. Keep in mind that thins like latency and scoring window are what we should consider "Blunt" (non-precise or arbitrary) metrics for feature definition. As such, they can be expected to be susceptible to overfitting to a training dataset, which would result in poor generalization to other samples and field data. Blunt parameters should be trained bluntly - so they will work with lots of data. Another example of overfitting with blunt methods was our experiment with a non-parametric OSS-3 - in which we used sign value of standardized measurements only, ignoring the actual magnitude of measurement. We found we could train the algorithm to a high level of accuracy with a single sample, but accuracy fell apart with other samples. This is consistent with information from 3 and seven position scoring research, and from some research (2001 I think) on rank scoring by Krapohl and others. The parametric OSS-3 model, which we have made available and described, uses the magnitude values, and the trained accuracy appears to generalize well to several samples of data. I just got off a very enjoyable phone call with Shawn Edwards from Stoelting. He's a smart and well educated person. He has an educational and work background quite similar to my own, as a therapist for a long time before entering polygraph. He's familiar with the standard batteries of IQ and personality tests. Anyone who has ever taken the Miller Analogies Test will enjoy this: Shawn Edwards is to Stoelting as Sue Lutrell is to ...??? CPS software allows the user to adjust the scoring window - even Kircher can appreciate that people may feel compelled to do that. I wouldn't adjust it myself, but would defer to the wisdom of Kircher's conclusions based on his analysis of the data. Feature selection with OSS-3 is a point at which we rest on the assumptions and conclusions from previous research, and make no attempt to investigate ourselves. John Kircher is a really smart guy, who already has a lot of answers regarding those things. He's published and described his work for us, and we should thank him - not reinvent the wheel. Then we get to put our time into solving the next unsolved problems, like transformations and combinatorics regarding mixed issues exams. r ------------------ "Gentlemen, you can't fight in here. This is the war room." --(Stanley Kubrick/Peter Sellers - Dr. Strangelove, 1964)
IP: Logged |
rnelson Member
|
posted 09-05-2007 05:02 PM
Shawn just sent me an email, correcting my information.Its OK to call out my error here Shawn. I'm as thick skinned as other. Besides I'd do the same for you... Here is what I got from a screenshot of the Stoelting software - based on Kircher's present recommendations. Scoring windows: Pneumo = 10 seconds, 0.0 sec latency EDA = 12.9 seconds, 0.5 sec latency Cardio 15 seconds, 0.0 sec latency From what Barry has described, I wonder if Kircher constructed a step-wise ROC plot of various scoring window lengths. It'd be fun to know. I don't know for sure, but I assume the same rules apply for slope testing, as described by Kircher and Raskin (1988). Shawn??? r
------------------ "Gentlemen, you can't fight in here. This is the war room." --(Stanley Kubrick/Peter Sellers - Dr. Strangelove, 1964)
IP: Logged |
Barry C Member
|
posted 09-13-2007 03:22 PM
One of the things that should be pointed out on this topic is that Polyscore - unless something has changed since I last read their help file - only produces "validated" decisions for Bi-zones with four charts - not three. The beauty of OSS-3 is that it'll score just about anything.IP: Logged |
rnelson Member
|
posted 09-16-2007 11:00 AM
Greetings, I'm still 100s of miles from home, but I needed a break from the road. Barry: quote: One of the things that should be pointed out on this topic is that Polyscore - unless something has changed since I last read their help file - only produces "validated" decisions for Bi-zones with four charts - not three. The beauty of OSS-3 is that it'll score just about anything.
Do you know why that is? We designed the mathematical transformations of OSS-3 so that we know the variance of the data regardless of how many RQs and regardless of how many charts. We also know the variance of the spots (equivalent tot he variance of the grand mean), which is an important achievement because it allows us to develop mathematical estimations of the likelihood of an erroneous result when making classifications (i.e., decisions) by spot. I commented earlier about my concerns with rank scoring of spots (whether you want to call it Gordon and Cochetti's Horizontal Scoring or Honts and Driscoll's Rank Order Scoring systems - its still just rank scoring), and the meaninglessness of between-spot variancy after rank transformation. A similar, but perhaps less unsurmountable, problem exists with spot scoring using traditional scoring methods - we cannot make mathematical estimations regarding the possibility of an error unless we know the variance of the spots. This could be done in traditional scoring if we had access to lots of confirmed spot-by-spot data (but we don't). We also have to appreciate that spot scoring of single issues exams is a mathematically distinct challenge compared with spot scoring of multi-face and mixed-issues exams. Presently, we draw little distinction between these, and our methods are "sorting" methods not mathematical methods. Of course, any examiner could take the time to transform (mathematically) the spot measurements to some standardized dimensionless value for which we can know the variance by spot, but that would be expensive work which computers could do much more quickly and reliably. We examiners too often engage in a form of verbal or empirical shell-game when we conveniently substitute decision accuracy statistics (per lab or field experiments, using sets or samples of data) for accuracy or error estimation of a single live field investigation. They are not the same. The accuracy of single examinations is correctly expressed in terms of the likelihood of an error of a certain classification decision is accepted. This is a bit of a departure from the topic, but its important. I would like to know more about why Polyscore would consider a BiZone (U-phase) result valid with four charts and not with three. It seems possible to me that this is a feature how poorly or mis- understood that method is. r ------------------ "Gentlemen, you can't fight in here. This is the war room." --(Stanley Kubrick/Peter Sellers - Dr. Strangelove, 1964)
IP: Logged | |